library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggridges)
We’ll be working with NOAA weather data, which is downloaded using
rnoaa::meteo_pull_monitors function in the code chunk
below; similar code underlies the weather dataset used elsewhere in the
course. Because this process can take some time, I’ll cache the code
chunk.
weather_df =
rnoaa::meteo_pull_monitors(
c("USW00094728", "USW00022534", "USS0023B17S"),
var = c("PRCP", "TMIN", "TMAX"),
date_min = "2021-01-01",
date_max = "2022-12-31") |>
mutate(
name = recode(
id,
USW00094728 = "CentralPark_NY",
USW00022534 = "Molokai_HI",
USS0023B17S = "Waterhole_WA"),
tmin = tmin / 10,
tmax = tmax / 10) |>
select(name, id, everything())
## using cached file: /Users/EmilyMurphy/Library/Caches/org.R-project.R/R/rnoaa/noaa_ghcnd/USW00094728.dly
## date created (size, mb): 2023-09-29 15:32:29.663439 (8.525)
## file min/max dates: 1869-01-01 / 2023-09-30
## using cached file: /Users/EmilyMurphy/Library/Caches/org.R-project.R/R/rnoaa/noaa_ghcnd/USW00022534.dly
## date created (size, mb): 2023-09-29 15:32:48.450514 (3.83)
## file min/max dates: 1949-10-01 / 2023-09-30
## using cached file: /Users/EmilyMurphy/Library/Caches/org.R-project.R/R/rnoaa/noaa_ghcnd/USS0023B17S.dly
## date created (size, mb): 2023-09-29 15:32:49.230685 (0.994)
## file min/max dates: 1999-09-01 / 2023-09-30
weather_df
## # A tibble: 2,190 × 6
## name id date prcp tmax tmin
## <chr> <chr> <date> <dbl> <dbl> <dbl>
## 1 CentralPark_NY USW00094728 2021-01-01 157 4.4 0.6
## 2 CentralPark_NY USW00094728 2021-01-02 13 10.6 2.2
## 3 CentralPark_NY USW00094728 2021-01-03 56 3.3 1.1
## 4 CentralPark_NY USW00094728 2021-01-04 5 6.1 1.7
## 5 CentralPark_NY USW00094728 2021-01-05 0 5.6 2.2
## 6 CentralPark_NY USW00094728 2021-01-06 0 5 1.1
## 7 CentralPark_NY USW00094728 2021-01-07 0 5 -1
## 8 CentralPark_NY USW00094728 2021-01-08 0 2.8 -2.7
## 9 CentralPark_NY USW00094728 2021-01-09 0 2.8 -4.3
## 10 CentralPark_NY USW00094728 2021-01-10 0 5 -1.6
## # ℹ 2,180 more rows
To create a basic scatterplot, we need to map variables to the X and Y coordinate aesthetics:
ggplot(weather_df, aes(x = tmin, y = tmax))
Well, my “scatterplot” is blank. That’s because I’ve defined the data
and the aesthetic mappings, but haven’t added any geoms:
ggplot knows what data I want to plot and how I want to map
variables, but not what I want to show. Below I add a geom
to define my first scatterplot:
ggplot(weather_df, aes(x = tmin, y = tmax)) +
geom_point()
## Warning: Removed 17 rows containing missing values (`geom_point()`).
The code below could be used instead to produce the same figure. Using this style can be helpful if you want to do some pre-processing before making your plot but don’t want to save the intermediate data.
weather_df |>
ggplot(aes(x = tmin, y = tmax)) +
geom_point()
## Warning: Removed 17 rows containing missing values (`geom_point()`).
Can also save the output of ggplot() to an object and modify / print it later
ggp_weather =
weather_df |>
ggplot(aes(x = tmin, y = tmax))
ggp_weather + geom_point()
## Warning: Removed 17 rows containing missing values (`geom_point()`).
The basic scatterplot gave some useful information – the variables are related roughly as we’d expect, and there aren’t any obvious outliers to investigate before moving on. We do, however, have other variables to learn about using additional aesthetic mappings.
Let’s start with name, which I can incorporate using the
color aesthetic:
ggplot(weather_df, aes(x = tmin, y = tmax)) +
geom_point(aes(color = name))
## Warning: Removed 17 rows containing missing values (`geom_point()`).
We get colors and have a handly legend. Next I’ll add a smooth curve and make the data points a bit transparent.
ggplot(weather_df, aes(x = tmin, y = tmax)) +
geom_point(aes(color = name), alpha = .5) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 17 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 17 rows containing missing values (`geom_point()`).
The smooth curve is for all the data but the colors are only for the scatterplot; this is due to where I defined the mappings. The X and Y mappings apply to the whole graphic, but color is currently geom-specific. Also having a hard time seeing everything on one plot, so I’m going to add facet based on name.
ggplot(weather_df, aes(x = tmin, y = tmax, color = name)) +
geom_point(alpha = .5) +
geom_smooth(se = FALSE) +
facet_grid(. ~ name)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 17 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 17 rows containing missing values (`geom_point()`).
If I prefer something that shows the time of year and also want to learn about precipitation:
ggplot(weather_df, aes(x = date, y = tmax, color = name)) +
geom_point(aes(size = prcp), alpha = .5) +
geom_smooth(se = FALSE) +
facet_grid(. ~ name)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 17 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 19 rows containing missing values (`geom_point()`).
Write a code chain that starts with weather_df; focuses
only on Central Park, converts temperatures to Fahrenheit, makes a
scatterplot of min vs. max temperature, and overlays a linear regression
line (using options in geom_smooth()).
weather_df |>
filter(name == "CentralPark_NY") |>
mutate(
tmax_f = tmax * (9/5) + 32,
tmin_f = tmin * (9/5) + 32) |>
mutate(tmin = (tmax * (9/5)) + 32) |>
ggplot(aes(x = tmin_f, y = tmax_f)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
A different version of the same weather data:
ggplot(weather_df, aes(x = date, y = tmax, color = name)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 17 rows containing non-finite values (`stat_smooth()`).
When you’re making a scatterplot with lots of data, there’s a limit
to how much you can avoid overplotting using alpha levels and
transparency. In these cases geom_hex(),
geom_bin2d(), or geom_density2d() can be
handy:
ggplot(weather_df, aes(x = tmax, y = tmin)) +
geom_hex()
## Warning: Removed 17 rows containing non-finite values (`stat_binhex()`).
There are lots of aesthetics, and these depend to some extent on the
geom; color worked for both geom_point() and
geom_smooth(), but shape only applies to
points. The help page for each geom includes a list of understood
aesthetics.
ggplot(weather_df) + geom_point(aes(x = tmax, y = tmin), color = "blue")
## Warning: Removed 17 rows containing missing values (`geom_point()`).
These are for understanding the distribution of single variables.
ggplot(weather_df, aes(x = tmax)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 17 rows containing non-finite values (`stat_bin()`).
Can play around with things like the bin width and set the fill color using an aesthetic mapping
ggplot(weather_df, aes(x = tmax, fill = name)) +
geom_histogram(position = "dodge", binwidth = 2)
## Warning: Removed 17 rows containing non-finite values (`stat_bin()`).
The position = "dodge" places the bars for each group
side-by-side, but this gets sort of hard to understand. I often prefer
density plots in place of histograms.
ggplot(weather_df, aes(x = tmax, fill = name)) +
geom_density(alpha = .4, adjust = .5, color = "blue")
## Warning: Removed 17 rows containing non-finite values (`stat_density()`).
The adjust parameter in density plots is similar to the
binwidth parameter in histograms, and it helps to try a few
values. I set the transparency level to .4 to make sure all densities
appear. You should also note the distinction between fill
and color aesthetics here. You could facet by
name as above but would have to ask if that makes
comparisons easier or harder. Lastly, adding geom_rug() to
a density plot can be a helpful way to show the raw data in addition to
the density.
ggplot(weather_df, aes(x = name, y = tmax)) + geom_boxplot()
## Warning: Removed 17 rows containing non-finite values (`stat_boxplot()`).
ggplot(weather_df, aes(x = name, y = tmax)) +
geom_violin(aes(fill = name), alpha = .5) +
stat_summary(fun = "median", color = "blue")
## Warning: Removed 17 rows containing non-finite values (`stat_ydensity()`).
## Warning: Removed 17 rows containing non-finite values (`stat_summary()`).
## Warning: Removed 3 rows containing missing values (`geom_segment()`).
These are a replacement for both boxplots and violin plots. They’re
implemented in the ggridges package, and are nice if you
have lots of categories in which the shape of the distribution
matters.
ggplot(weather_df, aes(x = tmax, y = name)) +
geom_density_ridges(scale = .85)
## Picking joint bandwidth of 1.54
## Warning: Removed 17 rows containing non-finite values
## (`stat_density_ridges()`).
Make plots that compare precipitation across locations. Try a histogram, a density plot, a boxplot, a violin plot, and a ridgeplot; use aesthetic mappings to make your figure readable.
ggplot(weather_df, aes(x = prcp)) +
geom_density(aes(fill = name), alpha = .5)
## Warning: Removed 15 rows containing non-finite values (`stat_density()`).
ggplot(weather_df, aes(x = prcp, y = name)) +
geom_density_ridges(scale = .85)
## Picking joint bandwidth of 9.22
## Warning: Removed 15 rows containing non-finite values
## (`stat_density_ridges()`).
ggplot(weather_df, aes(y = prcp, x = name)) +
geom_boxplot()
## Warning: Removed 15 rows containing non-finite values (`stat_boxplot()`).
This is a tough variable to plot because of the highly skewed distribution in each location. Of these, I’d probably choose the boxplot because it shows the outliers most clearly. If the “bulk” of the data were interesting, I’d probably compliment this with a plot showing data for all precipitation less than 100, or for a data omitting days with no precipitation.
weather_df |>
filter(prcp > 0) |>
ggplot(aes(x = prcp, y = name)) +
geom_density_ridges(scale = .85)
## Picking joint bandwidth of 20.6
Don’t use the built-in “Export” button because then the figure isn’t
reproducible - no one will know how the plot was exported. Instead, use
ggsave() by explicitly creating the figure and exporting;
ggsave will guess the file type you prefer and has options
for specifying features of the plot. In this setting, it’s often helpful
to save the ggplot object explicitly and then export it
(using relative paths!).
ggp_weather =
ggplot(weather_df, aes(x = tmin, y = tmax)) +
geom_point(aes(color = name), alpha = .5)
ggsave("ggp_weather.pdf", ggp_weather, width = 8, height = 5)
## Warning: Removed 17 rows containing missing values (`geom_point()`).
Embedding plots in an R Markdown document can also take a while to
get used to, because there are several things to adjust. First is the
size of the figure created by R, which is controlled using two of the
three chunk options fig.width, fig.height, and
fig.asp. I prefer a common width and plots that are a
little wider than they are tall, so I set options to
fig.width = 6 and fig.asp = .6. Second is the
size of the figure inserted into your document, which is controlled
using out.width or out.height. I like to have
a little padding around the sides of my figures, so I set
out.width = "90%". I do all this by including the following
in a code snippet at the outset of my R Markdown documents.
knitr::opts_chunk$set(
fig.width = 6,
fig.asp = .6,
out.width = "90%"
)
What makes embedding figures difficult at first is that things like
the font and point size in the figures generated by R are constant –
that is, they don’t scale with the overall size of the figure. As a
result, text in a figure with width 12 will look smaller than text in a
figure with width 6 after both have been embedded in a document. As an
example, the code chunk below has set fig.width = 12.
ggplot(weather_df, aes(x = tmin, y = tmax)) +
geom_point(aes(color = name))
## Warning: Removed 17 rows containing missing values (`geom_point()`).